Frontiers in Digital Health
○ Frontiers Media SA
Preprints posted in the last 30 days, ranked by how well they match Frontiers in Digital Health's content profile, based on 20 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit.
Tran, B. D.; Hu, D.; Kim, S.; Guo, Y.; Mangu, R.; Reynolds, T. L.; Lafata, J. E.; Tai-Seale, M.; Zheng, K.
Show abstract
Ambient clinical intelligence (ACI) systems use automatic speech recognition (ASR) to capture patient-provider conversations for downstream clinical documentation. However, many ASR evaluations are conducted under controlled conditions using specialized hardware. We evaluated how recording devices influence transcription performance of contemporary ASR engines applied to clinical dialogue. Thirty-five primary care encounters were re-enacted from transcribed conversations and recorded using five devices simultaneously: smartphone, laptop microphone, portable recorder, clip-on microphone, and a desktop microphone. Six ASR engines were evaluated using word error rate (WER), clinical concept extraction precision and recall, and sentence-level semantic similarity. Median WER ranged from 16.7% to 20.7% across engines. Engine choice produced larger variation in transcription performance than recording device, although device-related differences were statistically significant. Overall, contemporary ASR engines demonstrated relative robustness to consumer-grade recording hardware, suggesting that model selection may have greater impact on transcription performance than recording device configuration in real-world ACI deployments.
Svihrova, R.; Marzorati, D.; Odello, T.; Monachino, G.; Staletti, T.; Tieben, R.; Luigies, R.; Bodewes, N.; Rutten, W.; Barrett, G.; Bhogal, A.; Wilkinson, T.; Tzovara, A.; Faraci, F. D.
Show abstract
Cardiac rehabilitation is critical for secondary prevention, yet long-term adherence remains low. We present CUOREMA, a new personalized mobile health system integrating self-monitoring diaries, wearable data, virtual coaching, and reinforcement learning-enhanced adaptive interventions to support lifestyle change during and after outpatient cardiac rehabilitation. In a six-month two-center feasibility study (N = 53, Switzerland and France), we evaluated usability, engagement patterns, and preliminary health-related outcomes. Attrition was high: only 19\% of participants used the app on more than 100 days, and questionnaire response rates declined from 55\% at baseline to 13\% at six months. Despite these limitations, exploratory data-driven analysis revealed three distinct engagement clusters (dropout, sporadic, and consistent), which were further supported by matching patterns in app component usage, medication diary adoption, and smartwatch wearing time. Engagement clusters were not associated with demographic factors; instead, psychological themes of patients' personal goals suggested that intrinsic motivation characterized sustained users, whereas extrinsic motivation predominated among early dropouts. User experience was rated positively, and validated questionnaire scores showed no deterioration over time. One center demonstrated a statistically significant improvement in 6-minute walking test performance, though the study was not powered to detect clinical outcomes and selective dropout cannot be ruled out. These findings highlight engagement variability as a central challenge in digital cardiac rehabilitation and suggest that tailoring interventions to individual motivational profiles may improve long-term adherence.
Hu, D.; Flores, D.; Flores, L.; Chien, R.; Lam, K.; Chow, E.; Guo, Y.; Tam, S.; Perret, D.; Pandita, D.; Zheng, K.
Show abstract
Ambient AI documentation systems rely on automatic speech recognition to transcribe patient-provider conversations before generating clinical notes. However, little empirical evidence exists on how these systems perform in mixed-language clinical encounters. We conducted a mixed-method heuristic evaluation of an ambient AI documentation tool using 24 reenacted primary care conversations involving Spanish-English and Mandarin-English code-switching. Quantitative analyses measured mixed error rate (MER) and code-switching detection. Overall MER was low, with a median of 4% and less variation in Spanish-English conversations, and 9% in Mandarin-English conversations, but with outliers reaching 67%. The system generally detected language switches reliably, although deletions occurred frequently in Mandarin-English transcripts at switch points. Qualitative analysis revealed transcription errors related to phonetic similarity, automatic language translation, clinical terminology recognition, and language-specific challenges. These findings highlight considerations for improving ambient AI clinical documentation systems to support multilingual providers in delivering care for linguistically diverse populations.
Appiagyei, J. B.; Otu, R. O.; Henry, M. K.; Casterline, B. W.; Becevic, M.
Show abstract
Teledermatology expands access to dermatologic expertise in rural settings, yet diagnostic uncertainty persists in low-resource primary care. This retrospective study evaluated MedGemma-4B-IT, a compact multimodal vision-language model, as adjunctive clinical decision support for challenging diagnostic cases. We analyzed 77 zero-concordance cases (360 clinical photographs) from a Dermatology Extension for Community Healthcare Outcomes (ECHO) tele-mentoring program (2016-2021). Zero-concordance cases showed no overlap between primary clinician provisional diagnosis and dermatologist-confirmed diagnosis. The model was prompted using dermatologist-style format to generate ranked differential diagnoses. Performance was assessed using strict case-level top-k exact-match accuracy and relaxed matching criteria based on fuzzy string similarity. MedGemma achieved 0.0% strict top-1 accuracy, 1.3% top-3 accuracy, 3.9% top-5 accuracy, and 3.9% top-10 accuracy. Relaxed concept-level matching achieved 28.6% top-1, 63.6% top-5, and 67.5% top-10 accuracy. Image-level accuracy was 44.2% (159/360, 95% CI 39.0-49.5%). The model surfaced the correct diagnosis within differential lists in 45.5% of cases despite no exact top-1 matches, suggesting utility for differential expansion rather than definitive diagnosis. Performance varied across diagnostic categories, with highest accuracy in Other categories (54.5%) and lowest in neoplastic conditions (0.0%). Common errors included confusion between inflammatory and other diagnostic groupings. These findings characterize MedGemma performance on real-world teledermatology cases and inform safe, clinician-in-the-loop integration into teledermatology workflows where specialist oversight remains essential.
Dobbins, D.; Russell, A.; Gunther, M.; Shetty, V.; Shomali, A.; Vawdrey, D.; Waring, S.; Whary, P.; Wong, J.; Wright, E. A.; Olson, A. W.
Show abstract
Objectives: Older adults with comorbidities and polypharmacy have disproportionately high risk of hospitalization as well as readmission from adverse drug events (ADEs), of which 28%-71% are preventable (pADEs). This paper introduces an LLM application, CommunicADE, designed to support risk-mitigation of pADE-related readmission for the aforementioned population. We aim to evaluate CommunicADE's technical performance with OpenAI's HealthBench criteria: accuracy, completeness, communication quality, context awareness, and instruction following. Materials and Methods: Our technical validation study used an LLM (KimiK2.5) to simulate interviews between CommunicADE and nine high-fidelity synthetic patients hospitalized and at increased risk for pADE-related readmission (65+ years, comorbidities, 5+ medications). Some pADE risk mechanisms clues were visible to CommunicADE in patient H&Ps, but most mechanisms were solely discoverable in interviews. Two pharmacists evaluated CommunicADE's interview questions and EHR notes with HealthBench-informed variables. Analyzes used descriptive statistics. Results: For 35 mechanisms across 9 patients (avg=3.89 mechanisms/patient), CommunicADE's precision and recall were 0.92 and 0.63, respectively. Hallucinations were absent. Coherence and person-centeredness scored 4.28 and 4.44 on a 5-point scale (5=highest). On average, communication was at a 5th grade level and objective for 78% of patients. Most patient-reported quotes included in notes (92%) supported detected mechanisms. CommunicADE followed all instructions regarding interview length and patient approvals. Discussion: CommunicADE's strongest performance was in accuracy (precision, hallucinations), communication quality (coherence, readability), context awareness (person-centeredness). Completeness (recall) and instruction following (objectivity, pADE mechanism/quote alignment) show room for improvement. Conclusion: Findings suggest technical readiness for a feasibility pilot with real-world patients, and key areas for performance improvement.
Bermejo-Pelaez, D.; Darias, O.; Pastor, L.; Valles, R.; Diez, N.; Lin, L.; Garcia-Villena, J.; Cuadrado, D.; Vladimirov, A.; Alamo, E.; Postigo, M.; Rodriguez-Dominguez, M.; Canton, R.; Rodriguez-Tudela, J. L.; Alastruey Izquierdo, A.; Bohorquez, L. C.; Rubio, J. M.; Dacal, E.; Luengo-Oroz, M.
Show abstract
Introduction. Lateral flow assays (LFAs) are indispensable rapid diagnostic tools in healthcare, enabling point-of-care diagnosis critical for patient management and support disease burden assessment and surveillance when results are properly recorded. However, misinterpretation errors and unreported cases remain a concern. A quality-assured, affordable Ai-powered tool, supporting the decision-making during result interpretation could promote proper disease monitoring and epidemiological surveillance. Here, we describe the performance of a universal AI model to digitize and interpret results from multiple LFA types through a smartphone application, a step that could ultimately enable standardized and digitally reportable test outcomes. Methods. The AI algorithm was evaluated in 17 LFA types, including both 2-band and 3-band tests for different diseases and manufacturers. The model was trained on a dataset of 22,576 images captured under diverse lighting conditions with different smartphone models and using a custom mobile application, TiraSpot (Spotlab, Madrid, Spain). To assess generalizability, a leave-one-out cross-validation was applied, where in each LFA type was iteratively excluded from training and used for testing. Model performance was evaluated using bootstrapping on the inference dataset. Results. In the assessment of the model's ability to generalize to new LFA types not previously analyzed (not included during development), the model achieved an overall AUC of 94.3% for second band detection. This overall performance was enhanced to 99.3% (Sensitivity=98,6%; Specificity=98%) after training with 50 images of each LFA type, highlighting the benefit of additional data for specific LFA types. For the third band detection, where less training data was available, the system achieved an overall AUC of 83.9% for unseen LFAs, improving to 94.2% (Sensitivity=92.9%; Specificity=87,9%) after training with 50 images of each LFA type. Conclusion. This system demonstrates the feasibility of an AI-powered universal digital reader for interpreting LFA results from diverse test types using smartphone-captured images. Its compatibility with standard smartphones makes it a universal tool, enabling reliable LFA interpretation across devices and settings. By standardizing test interpretation and digitizing results, this tool could support decision making in result interpretation, enhancing epidemiological surveillance, particularly in resource-limited settings. Its adaptability across various infections highlights its potential to improve diagnostic consistency and support disease management in diverse healthcare settings.
Kang, W. J.; Sim, J.; Loh, E. E. M.; Lim, A. C. Y.; FOONG, K. W. C.
Show abstract
Importance. Large language models are increasingly explored as clinical decision support tools in orthodontics, yet existing evaluations have been confined to knowledge based question answering where reported accuracy ranges from 18% to 100%. No study has evaluated performance on the computational and classificatory tasks that define daily diagnostic work. Furthermore, 84.3% of published healthcare large language model studies fail to report the number of repeated queries performed, leaving output stochasticity unexamined. Objective. To compare the diagnostic accuracy and output consistency of three frontier reasoning-enhanced large language models, namely, ChatGPT 5.4 (Thinking), Gemini 3 (Thinking), and Claude Opus 4.6 (Extended Thinking), on Bolton analysis, Index of Orthodontic Treatment Need-Dental Health Component (IOTN DHC) classification, space analysis, and lateral cephalometric interpretation. Methods. In this comparative cross-sectional study with a repeated-measures design, each model, accessed through its respective consumer facing web interfaces under default provider settings rather than through application programming interfaces, processed 200 purpose-built items (50 per task) across four independent trials, yielding 2,400 observations. Responses were scored against a pre-established reference standard by two independent raters using strict binary exact match criteria. Accuracy was reported with exact binomial 95% confidence intervals. Inter-model comparisons used Cochran's Q test with post-hoc McNemar's tests and Bonferroni correction. A supplementary context-rich prompting evaluation was conducted on 40 items (480 observations). Results. Claude Opus 4.6 (Extended Thinking) achieved the highest accuracy (99.0%; 95% CI: 96.4 to 99.9%), followed by Gemini 3 (Thinking) (95.5%; 91.6 to 98.1%) and ChatGPT 5.4 (Thinking) (94.0%; 89.8 to 96.9%) (Cochran's Q=6.87, p=0.032). Each model exhibited distinct, non-overlapping error profiles concentrated at the normal-abnormal classification boundary. An accuracy-consistency paradox emerged: the most accurate model was the least consistent (93.0%), while the least accurate was the second-most consistent (98.0%). Context rich prompting eliminated all errors across all three models. Interpretation. Frontier reasoning large language models achieved high overall accuracy on orthodontic diagnostic tasks but retained concealed, task-specific vulnerabilities detectable only through repeated-trial evaluation. An accuracy-consistency paradox, in which the most accurate model was the least consistent, demonstrates that single-trial evaluations cannot characterise clinical risk. The reasoning modes were associated with high arithmetic accuracy but did not compensate for imprecise parametric knowledge on classification tasks; however, the absence of a non-thinking baseline means this association cannot be attributed to the thinking mode itself. Context-rich prompting eliminated all errors on synthetic data but should be regarded as a necessary yet insufficient prerequisite for clinical deployment pending prospective validation on real patient data.
Ariel, D.; Grumberg, L. R.; Supakul, S.; Wannasri, S.; Mitchnik, I. Y.; Lev, A.; Ariyamethanon, W.; Agbarieh, M.; Miari, S.; Laban, G.; Hasid, B.
Show abstract
The same patient question can yield different clinical quality across languages. Across 504 forum-derived patient queries in six languages and four chatbots, language-matched clinicians rated responses on five clinical dimensions (1,008 ratings; 5,040 dimension scores). Patient language outweighed chatbot identity across the four clinical-substance dimensions (composite language partial {superscript 2} 0.275 vs chatbot 0.035; robust to investigator-rating exclusion: {superscript 2} 0.260) but not for empathy ({superscript 2} 0.029): clinical substance was language-associated; warmth was relatively preserved. Catastrophic safety ratings ranged 4.3-fold by language (3.6% English, 15.5% Thai and Hebrew); 62% of catastrophic ratings exceeded the English baseline (descriptive disparity). Failures were systematic and silent: none of 24 stroke responses conveyed time-criticality framing, none of 24 CO-poisoning responses challenged the familys stress framing, and 120 sentinel responses contained no confident errors. Warmth did not discriminate clinical danger (response-level empathy AUC = 0.49): consumer health AI can deliver fluent, caring tone with degraded clinical substance.
Reyes Nieva, H.; Flanagan, M.; Huang, S.; Theodore, D. A.; Nkodo, A. F.; Parkinson, M.; Hill, S.; McAndrew, M.; Benitez, J. A.; Peralta, H.; Amesty, S.; Zucker, J. E.; Sobieszczyk, M.; Castor, D.
Show abstract
Background: Long-acting pre-exposure prophylaxis (PrEP) expands HIV prevention options for women. However, PrEP impact depends on addressing persistent gaps in awareness, access, and use. Artificial intelligence (AI) tools, including conversational agents, are being explored to advance PrEP uptake, but comfort with AI may influence their impact. Thus, we examined women's comfort with AI and its association with PrEP awareness. Methods: We analyzed self-reported data from women aged [≥]18 years in a cross-sectional survey conducted in New York City from August 2023 to August 2024. We performed descriptive analyses, applied latent class analysis to identify AI knowledge/comfort profiles, and estimated unadjusted and adjusted odds ratios to assess associations between profile membership and PrEP awareness. Results: Among 306 respondents without a diagnosis of HIV who completed AI-related survey items, the median age was 36. Most women identified as Hispanic/Latina (60%) or Non-Hispanic Black (18%), had not completed college (53%), and spoke only English or were bilingual (81%). Latent class analysis identified four AI knowledge/comfort profiles that differed by PrEP awareness, race/ethnicity, borough, prior drug use, and technology utilization. Women with varied AI knowledge, broad AI discomfort, and comfort with clinicians maintaining privacy had lower odds of PrEP awareness (OR: 0.35, 95% CI: 0.16-0.75), but this association did not persist after statistical adjustment. Conclusions: PrEP awareness and AI knowledge were limited, yet many women expressed openness to AI-enabled tools when privacy was assured. AI-enabled HIV prevention tools should prioritize trust, transparency, confidentiality, and the lived contexts of the women they intend to serve.
Benning, L.; Hirsch, A.; Groeschel, M.; Roeschl, T.; Spott, M.; Hans, F. P.; Urban, T.; Busch, H.-J.; Meyer, A.; Madrid, J.
Show abstract
Background Emergency department (ED) triage is a high-stakes clinical decision process that determines patient prioritization and resource allocation under time pressure. Large language models (LLMs) have recently been proposed as decision-support tools for triage, yet most evaluations rely on simulated scenarios or curated datasets. Evidence from real-world clinical environments remains limited. The objective of this project was to systematically evaluate the performance, calibration, and reproducibility of multiple contemporary large language models for Emergency Severity Index (ESI) classification and sectoral allocation (ED vs. urgent care practice, UCP) using a comprehensive real-world triage dataset. Material and Methods Retrospective cross-sectional benchmarking study conducted at a tertiary academic emergency ED in Germany with an integrated central point of assessment (CPA). The study included all consecutive adult walk-in encounters (>18 years) presenting between October 2023 and February 2024 (N = 16,107). Data were collected from a structured clinical decision support system capturing presenting complaints, vital signs, and triage decisions recorded by specialized nursing staff. Structured clinical variables routinely collected at triage, including presenting complaint categories (CEDIS-PCL), vital signs according to the ABCDE framework, and additional structured or free-text clinical information. Results The primary outcome was the agreement between LLM-predicted and nurse-assigned ESI levels measured using quadratic-weighted Cohen's k. Secondary outcomes included sectoral assignment agreement, misclassification patterns (over- and under-triage), calibration metrics, and output reproducibility. Quadratic-weighted k values ranged from 0.18 to 0.75 across models. Only a structured stepwise prompting strategy achieved substantial agreement (k_qw = 0.747), approaching reported human inter-rater reliability. Most models demonstrated moderate or lower agreement and systematic overconfidence, with expected calibration errors (ECE) based on verbalized confidence ranging from 0.099 to 0.355. Sectoral assignment agreement (i.e. ED vs. urgent care practice, UCP) was uniformly low (k < 0.30). Reproducibility testing revealed substantial variability in 23% of cases, indicating non-deterministic output behavior for clinically relevant decisions. Conclusions Current large language models demonstrate heterogeneous and generally limited performance in real-world emergency triage tasks. Structured algorithm-guided prompting appears more influential than model architecture or size. Before clinical implementation, improvements in calibration, reliability, and workflow integration are required, alongside regulatory-compliant validation in prospective clinical settings.
Komolafe, O. O.; Roberts, A. C.; Shelley, J.; Tawiah, A. K.
Show abstract
High-quality, domain-specific datasets are foundational to advancing educational tools and AI systems in healthcare, yet assembling case repositories from real-world clinical records faces substantial privacy, ethical, and licensing barriers. Synthetic data generation offers a compelling pathway forward, but educational cases require rigorous validation to ensure clinical plausibility and pedagogical utility. This pilot study introduces PhysiCase, a dual-layer validation pipeline for synthetic case generation and evaluates the feasibility of combining automated LLM-based screening with expert educator review. We generated 128 synthetic musculoskeletal(MSK) cases using four frontier large language models (GPT-4.1, GPT-4o, Google Gemini 2.5 Pro, and Llama 4 Scout) across 28 clinical conditions. Cases underwent automated quality screening using an "LLM-as-judge" framework (DeepEval) assessing prompt alignment, JSON correctness, answer relevance, bias, toxicity, and completeness. Ninety cases (70.3%) passed automated filtering and proceeded to expert evaluation by four MSK physiotherapy educators, who rated medical accuracy, realism, fidelity, relevance, and usability on 5-point Likert scales. GPT-4.1 demonstrated the highest automated pass rate (96\%) and strongest expert ratings (medical accuracy 4.10/5, usability 4.38/5), while Llama 4 Scout showed the lowest pass rate (33.3%) and expert ratings. Expert-evaluated cases achieved strong content validity indices for usability (97.5%), relevance (97.5%), and realism (95%), though medical accuracy showed greater variance (CVI 87.5%). Cross-layer correlation analysis revealed that automated completeness metrics moderately aligned with expert usability ratings , while answer relevance and prompt alignment showed weak or negative correlations with clinical correctness. Qualitative analysis identified three primary failure modes: reductive logic, biomechanical inconsistency, and administrative/contextual gaps. The dual-layer validation framework proved methodologically viable: automated screening efficiently reduced expert review burden, while human judgment remained indispensable for detecting subtle clinical reasoning failures. LLM-generated synthetic cases has the potential to meet practical educational needs for MSK physiotherapy, but expert validation is essential to safeguard clinical accuracy. These findings support a scalable division of labour for synthetic case development, with targeted improvements to prompting and automated reasoning checks needed to address identified "nuance gaps." The code for this paper is available on https://github.com/kwid-ai/PhysiCase
Kurt, F.; Subasi, S. N.; Yakisan, E. S.; Subasi, A.
Show abstract
Background: Wearable technologies enable scalable and continuous monitoring of emotional states through passive sensing of physiological and behavioral signals. However, conventional learning approaches often struggle to model the complex temporal, contextual, and relational dependencies underlying human emotions. To address these limitations, we propose a graph-based framework that represents multimodal wearable observations as heterogeneous knowledge graphs enriched with semantic information derived from Large Language Models (LLMs), enabling richer contextual understanding beyond raw sensor measurements. Methods: We constructed a heterogeneous knowledge graph using multimodal Fitbit physiological signals and affective self-report data collected from 45 users. Framing mood prediction and emotion detection was formulated as both binary and ternary node classification tasks. We evaluated five baseline heterogeneous Graph Neural Network (GNN) architectures and compared them with the proposed Semantically Gated Augmented Graph Neural Network (SeGA-GNN) framework, which dynamically integrates LLM-generated semantic embeddings into graph representations through a gated cross-modal fusion mechanism. Results: The baseline GNN models achieved strong performance, with classification accuracies ranging from 0.7525 to 0.9739 for binary classification and 0.6249 to 0.9699 for ternary classification. The proposed SeGA framework consistently improved predictive performance across most architectures. In particular, semantic augmentation transformed the HAN model from moderate baseline performance into near-perfect emotion recognition capability, achieving SeGA-HAN Accuracy = 0.9988 and AUC = 1.0000 for binary classification and Accuracy = 0.9979 and AUC = 1.0000 for ternary classification. Discussion and Conclusion: Integrating LLM-derived semantic contextualization into heterogeneous graph learning enables effective modeling of contextual information that is not directly captured by wearable physiological signals alone. The proposed SeGA-GNN framework demonstrates that adaptive semantic fusion substantially improves the accuracy, robustness, and interpretability of wearable-based emotion detection. These findings establish a promising direction for next-generation wearable affective computing systems and intelligent emotion-aware applications.
Landry, T. C.; Kim, Y.
Show abstract
Background. Capillary refill time, an examiner-dependent bedside test of distal microvascular perfusion, has become a resuscitation target in septic shock,1,2,3,4 motivating a continuous surrogate computed from the photoplethysmogram (PPG, the optical waveform the pulse oximeter on every ICU patient already records).5,6,7,8 Objective. We attempted three PPG-derived candidate measures on the MIMIC-IV Waveform Database (MIMIC-IV-WDB v0.1.0) and asked, by inspecting randomly drawn examples, whether each captured its intended physiology before any downstream modeling. Methods. MIMIC-IV-WDB v0.1.09 was linked to MIMIC-IV.10 The signals were a cuff-anchored perfusion-index recovery (reactive hyperemia when the cuff shares an arm with the probe), a slow Mayer-wave-band power ratio of the perfusion index (sympathetic vasomotor tone), and a per-beat diastolic exponential decay time constant (a refill-like recovery time). For each signal we drew 10 random examples at a fixed seed and checked them against a checklist fixed in advance. Each was read by the author and, separately, by MedGemma 1.5, a multimodal medical language model run locally. A synthetic test with a known time constant checked the third signal. Results. The cuff-anchored signal showed the expected occlusion-reperfusion shape on 268 of 6,236 evaluable cuff cycles (4.30%) in 15 of 19 patients, consistent with opposite-limb placement of the probe and cuff. The slow-band ratio returned a stable cohort value, but a clear, stationary peak appeared in only4 of 10 random windows. The per-beat fit met its goodness-of-fit threshold in 10 of 10 beats, yet a cardiac-frequency heuristic flagged a possible fit on the heart-rate oscillation in 7 of 10, and in 5 of 17 patients the time constant lay where an exponential is indistinguishable from a straight line. A 0.5Hz high-pass pre-filter implanted its own approximately 318 ms time constant regardless of truth. The language model tracked the human on clear positives but reported the pattern present on every call it returned, never absent. Conclusions. Two of the three candidate signals did not reflect their intended physiology in most examples, and the third was constrained by sensor placement. Inspecting a few random raw inputs against a checklist written in advance is an inexpensive upstream check before downstream inference on PPG-derived microvascular signals.
Wang, Y.; He, H.; Zhu, R.; Lu, Y.; Phadungsaksawasdi, P.; Peng, M.; Liu, Z.; Zou, K.; Zhang, Y.; Chew, S. P.; Tham, Y. C.; Khorasani, A.; Deng, H.; Cheng, C.-Y.; Yang, J.; Liu, D.
Show abstract
Background Patients worldwide receive healthcare in many languages, yet medical AI systems are validated almost exclusively in high-resource languages such as English and Chinese, exposing patients in other linguistic settings to unquantified diagnostic risk. Existing multilingual evaluations rely on translated research-style benchmarks that fail to capture authentic clinical work. We aimed to characterise the patient safety consequences of multilingual medical AI deployment in real-world clinical settings and to develop an auditable detection method for unsafe outputs. Methods We evaluated different language models (LLMs) and visual language models (VLMs) across four real-world clinical tasks (conversational QA, radiology report generation, glaucoma diagnosis, ICU re-intubation prediction) in five languages (English, Chinese, Malay, Thai, Persian). We developed a token-level uncertainty toolkit to localise reasoning instability, compared three inference paradigms (native-language, English chain-of-thought, back-translation pivot), and conducted a prospective study (50 dialogues, 150 physician-reviewed records). Findings LLM/VLM performance degraded consistently from high- to low-resource languages across all tasks. Key gaps included: HealthBench score declining from 0.3743 to 0.3180; radiology macro-F1 from 0.2938 to 0.2149-0.2424, consistent with selective pathology suppression; glaucoma accuracy from 50.7% to 32.7%; ICU parameter recall from 100.0% to 48.5%. Multimodal inputs amplified degradation. Qwen3 VL 235B showed attenuated decline with no resource-ordered pattern in glaucoma classification. Token-level analysis localised instability to mid-chain stages (40-70% of the normalised trajectory); perplexity-based confidence failed to flag errors (AUROC 0.41-0.66). Back-translation pivot consistently restored performance. In the prospective study, 98.7% of records required physician edits (overall modification score 53.6%); Thai-pivot correction burden (59.0%) exceeded English-pivot (50.7%, p=0.003) and Chinese-direct (51.0%, p=0.004). Interpretation Multilingual deployment produced clinically consequential failures, including missed pathology, distorted physiological extraction, and amplified multimodal misclassification, that were invisible to monolingual validation and not reliably flagged by model confidence. Pretraining data composition may contribute to multilingual safety risk. Language-specific safety auditing should precede deployment in non-dominant-language healthcare settings; the open-source detection toolkit enables this without model retraining.
Madison, M.; Wheaton, L. A.; Rowe, V.
Show abstract
Background: Occupational therapists can improve stroke survivors hand and arm movement and participation in daily activities through action observation (AO). AO involves watching another persons hand or arm complete a movement or task. While research generally supports the use of AO with stroke survivors, there are limited AO videos are available to occupational therapists which makes applying AO challenging. Objective: The purpose of this work is to develop structured and widely accessible tool to support access to AO for stroke survivors, occupational therapists, and researchers. Methods: To develop an AO video library for stroke rehabilitation, functional and non-functional upper limb task deficits were first identified through clinical observations and clinician interviews to establish a prioritized list of daily activities. In collaboration with media production specialists, healthy adult volunteers were recruited and filmed performing these tasks from both first- and third-person perspectives. The recorded videos were then systematically edited, enhanced with instructional title slides, and distributed via a public YouTube channel for clinical application and a categorized digital repository for research purposes. Results: Initial assessments revealed a complete lack of familiarity, awareness, and utilization of AO resources among local occupational therapists, despite high perceived clinical utility. To address this gap, a final library of 150 tasks was established, resulting in the production of 419 finalized, standardized videos featuring six healthy volunteers. For clinical application, these videos were hosted on a free, public YouTube channel organized into 18 functional playlists, while a parallel set was structured into distinct movement categories for research repository storage. Conclusion: By providing a structured and highly accessible tool, this repository enables clinicians, researchers, and caregivers to readily implement evidence-based action observation interventions in both clinical and home settings.
Roy, J.; Korleski, J. B.; Augustin, R. C.; Yefet, L.; Jensen, Z. D.; Ehman, E. C.; Zadeh, G.; Conners, A. L.; Tevaarwerk, A. J.; Korfiatis, P.
Show abstract
Background: Preparing tumor board patient summaries is time intensive. Large-language-model based systems may automate summarization but require real-world evaluation prior to clinical use. We performed an exploratory retrospective evaluation of the Microsoft Healthcare Agent Orchestrator (HAO), deployed in a Mayo Clinic controlled staged environment, to generate tumor board-style patient summaries from retrospective Electronic Health Record (EHR) notes. Methods: HAO generated summaries for breast, hepatobiliary, and neuro-oncology tumor board cases using up to the most recent 1,000 clinical notes. Clinician reviewers evaluated outputs via REDCap surveys across perceived factuality, completeness, clarity/conciseness, temporal cohesion, comparative performance, safety, and clinical utility (0-4 Likert scale). Reviewers were permitted to query the HAO chat interface to address missing details. Automated factuality was assessed using TBFact (bidirectional entailment), reporting precision and recall against available reference summaries. Results: Among 57 survey responses from 5 different physicians, mean scores exceeded 2.8 across domains, with medians of 3 for most axes. In an exploratory comparison, oncology fellows required less time to review HAO-generated summaries than to manually generate patient summaries (mean difference 13.57 minutes per patient, p<0.001), although this difference may be influenced by prior familiarity with the same cases; 96% of survey responses indicated that HAO would save time. TBFact evaluations showed higher recall than precision across domains, consistent with broad capture of reference content alongside additional content that was not present in gold-standard summaries. Attribution was viewed favorably but showed issues with primary-source specificity and link reliability. Conclusions: In a controlled Mayo environment, HAO demonstrated moderate performance and was associated with reduced review time for tumor board preparation. These findings are promising but preliminary and do not establish clinical safety, noninferiority to manual review, or readiness for routine clinical use. Limitations, including verbosity, specialty-specific content gaps, and inconsistent attribution, highlight the need for iterative refinement and further evaluation.
Shah, K. P.; Airan Javia, S.; Savage, T.; Bressman, E.
Show abstract
End-of-rotation handoffs are critical for patient safety but add to documentation burden for hospitalists. Generative artificial intelligence (AI) may help automate handoff creation using electronic health record data, but its impact on quality and safety is unclear. Methods: We developed an AI handoff tool with a large language model using clinical notes as input and conducted a retrospective evaluation comparing AI-generated and clinician-authored handoffs. Handoffs were assessed across domains of quality and safety through a structured review. Results: Quality ratings were similar between AI and human handoffs (3.7 vs. 3.5, p=0.57). AI-generated handoffs were rated higher for organization (4.4 vs. 4.1, p=0.05) and completeness (4.1 vs. 3.6, p=0.01), but lower for conciseness (3.7 vs. 4.1, p=0.03) and accuracy (4.1 vs. 4.4, p=0.03). Error rates were comparable (0.3/handoff in both groups); however, AI-generated handoffs included inaccuracies (9% of AI errors) and hallucinations (1% of AI errors), while clinician-authored handoffs contained only omissions. Conclusion: Human and AI handoffs have differing error profiles and tradeoffs between completeness and conciseness. Prospective evaluation in clinical workflows is underway.
Gunsilius, C. Z.; Pei, P.; Carayannopoulos, A.; Petzschner, F. H.
Show abstract
Ecological momentary assessment (EMA) enables real-time, longitudinal measurement of symptoms and behavior via smartphones, yet nearly all feasibility evidence comes from protocols lasting one to two weeks, far shorter than the timescales over which chronic diseases fluctuate and clinical decisions unfold. Whether daily compliance can be sustained over months, or whether it decays as short-protocol trends predict, is unknown. Here, 214 participants (173 with pain, 41 healthy controls) completed a 4-month (122-day) EMA protocol via the Soma smartphone app, generating 26,907 check-ins. Half the sample completed the full protocol without a two-week lapse. Aggregate compliance appeared moderate (50%), but this conflated two distinct phenomena: when recomputed over each participant's active period, compliance rose to 71%, with 91% achieving moderate-to-high adherence, and remained stable across all 17 study weeks. Pain status predicted earlier disengagement but not lower compliance among those who remained; after adjustment for differential retention, group differences disappeared. To our knowledge, this is the longest continuous daily EMA evaluation in a clinical population. It suggests the primary barrier to long-duration EMA is not declining motivation among active participants but concentrated early disengagement, with direct implications for the design of digital health protocols, decentralized trials, and remote symptom monitoring.
Jovanova, M.; Bruegger, V.; Svirhrova, R.; Fuchs, M.; Jin, Q.; Wortmann, F.; Mitter, M.; Bechny, M.
Show abstract
One in four adults has insulin resistance (IR), a modifiable driver of type-2 diabetes that can precede diagnosis by a decade. However, IR assessment remains clinic- and laboratory-based, limiting repeated population screening. We tested whether free-living wearable data can detect IR in adults with normoglycemia or prediabetes. Machine-learning models using continuous glucose monitor (CGM)-based glucose dynamics and smartwatch-based heart rate/heart rate variability were developed in Study 1 (N = 97) and externally validated without retraining in Study 2 (N = 61, 31% IR prevalence). The best-performing CGM-based model achieved AU-ROC = 0.873 [0.756-0.967] and AU-PRC = 0.816 [0.640-0.934], outperforming an anthropometrics-only baseline (AU-ROC = 0.749, AU-PRC = 0.593). Findings are the first to detect IR from wearables without blood tests or structured glucose challenges, with state-of-the-art comparable performance. By enabling continuous at-home screening, this approach can identify undetected at-risk individuals and trigger confirmatory blood tests to close detection gaps.
Bressman, E.; Auerbach, A.; Keniston, A.; Jens, C.; Ranji, S.
Show abstract
Introduction: The use of artificial intelligence (AI) by clinicians has increased rapidly in recent years, with large language models (LLMs) emerging as tools that can equal clinician diagnostic performance in simulated settings. However, limited data exist regarding physicians use of LLMs in real-world clinical practice. This study aimed to evaluate the frequency of LLM use among practicing hospitalists, identify which LLMs are most commonly utilized, and assess hospitalists' perceptions of the benefits and limitations of LLM use in clinical care. Methods: We conducted a cross-sectional survey study of academic hospital medicine faculty across 8 institutions within the Hospital Medicine Reengineering Network (HOMERuN), a collaborative research consortium. Eligible participants included hospitalists practicing within participating HOMERuN sites during the study period. The survey assessed the frequency of LLM use, types of LLMs used, clinical applications, and physician perceptions regarding usefulness, efficiency, and concerns associated with LLM adoption. Results: 170 respondents (67.1%) reported ever using an LLM in clinical practice. Among LLM users, OpenEvidence was the most used tool (88.9%), followed by ChatGPT (58.5%), Google Gemini (26.9%), and Microsoft Copilot (20.5%). Only a minority of hospitalists reported using LLMs daily while seeing patients. The most common use cases of LLMs were answering diagnostic (77.1%) and management (77.6%) questions. A majority also reported using LLMs to identify or summarize primary literature (60.0%). Lack of trust in outputs (49.8%), uncertainty around institutional policies (48.6%), and lack of access to secure applications (43.1%) were cited as the most frequent barriers to using LLMs in practice. Discussion: The use of LLMs in clinical practice is already widespread, though regular or daily use is not yet typical. Concerns regarding reliability, patient privacy, and safe integration into clinical workflows remain significant barriers to broader adoption. The responsible implementation of LLMs in hospital medicine will require addressing these barriers.